# A tibble: 4 x 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa~
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa~
3 audi a4 2 2008 4 manual(m6) f 20 31 p compa~
4 audi a4 2 2008 4 auto(av) f 21 30 p compa~
Suppose that the variable hwy (fuel efficiency on the highway) is very expensive to measure.
We decide to estimate it using the other variables. To do so, we will fit a regression model. \[
\text{hwy} \approx \text{model}(\text{other variables})
\]
Simple regression
We expect the variable cty to be a good proxy for hwy.
After all, if a car is efficient in the city, we expect it to also be efficient on the highway! We will therefore consider a simple regression model
\[
\text{hwy} \approx \text{model}(\text{cty})
\]
Simple linear regression
ggplot(d) +geom_point(aes(x = cty, y = hwy))
It turns out that the variables cty and hwy are linearly associated.
We therefore decide to use a **simple linear regression* model
We can use our models to estimate hwy for new vehicles.
Imagine there is a new vehicle with \(\text{cty} = 30\). Instead of measuring its hwy, we decide to use our model to estimate it. Using the “good” model gives the following estimate \begin{align*} & _0 + _1 \ & = 1 + 1.3 * 30\ & = 40\ \end{align}
Is this the true hwy os the new vehicle? No! - this is only an estimate based on the value of the variable cty and our “good” model.
Can we do better? Yes! - Take additional variable into account in the model (e.g. engine size, vehicle age, etc) - Use better values for \(\beta_0\) and \(\beta_1\).
Group exercise - parameters
What is the prediction for the new vehicle if we use the bad model?
copy andpaste the following piece of code and try different values for the parameters to find a good set of values.
A simple linear regression model is only applicable if the relation between the predictor and the response is linear.
If the relation is not linear, the simple linear regression is not suitable.
In this case, we need to model the non-linearity (next lecture).
Group exercise - linear association
Exercise 7.3
03:00
Leat-square estimates
Residuals
Our predictions are only approximate.
Let us represent our prediction with \(\widehat{\text{hwy}}\) and the true value with \(\text{hw}\)
the error we make is \(\text{hw} - \widehat{\text{hwy}}\)
this is the residual
\[e = \text{hwy} - \widehat{\text{hwy}}\]
Visualizing residuals
Black circles: Observed values (y = hwy)
Pink solid line: Least-squares regression line
Maroon triangles: Predicted values (y = .fitted)
Gray lines: Residuals
Residual plot
d %>%mutate(hwy_hat =1+1.3* cty,resid = hwy - hwy_hat ) %>%ggplot() +geom_point(aes(cty, resid)) +geom_abline(intercept =0, slope =0, col ="red")
Group exercise - residuals
Exercises 7.1, 7.17, 7.19
05:00
Good estimates
We want to choose estimates that give accurate prediction
We want to minimize the residuals
Perhaps the most natural thing to do is to find the values for \(\beta_0\) and \(\beta_1\) that minimize the sum of absolute residuals\[|e_1|+|e_2|+\dots+|e_n|\] For practical reasons, the sum of squared residuals is more common criterion \[e_1^2+e_2^2+\dots+e_n^2\] ## Why squaring the residuals?
can work by hand (pre-computer)
reflects the assumptions that being off by \(4\) is more than twice as bad as being off by \(2\)
nice mathematical properties
mainstream
Least-square estimates
We find the values for \(\beta_0\) and \(\beta_1\) that minimize the SSR with the R function lm
Note that the slope coefficient is negative; which makes sense since cars with larger engines would tend to be less efficient.
We now have two models. Which is the best?
We could start by looking at the residuals
Comparing residuals
m <-lm(hwy ~ cty, data = d)m_augment <-augment(m)ggplot(m_augment) +geom_histogram(aes(.resid))
lm(hwy ~ displ, data = d) %>% augment %>%ggplot() +geom_histogram(aes(.resid))
The first model seems to have residuals of smaller magnitude.
But we need a more systematic approach for comparing models:
looking at a plot can be misleading (illusions)
difficult to compare models with similar residuals
SSR
Instead of comparing histograms to identify the model with the smaller residuals, we can compute the SSR (sum of squared residuals!)
small residuals will give a small SSR
large residuals will give a large SSR
\[
SSR = r_1^2+r_2^2+\dots+r^2_n
\]
Simply choose the model with the smaller SSR!
📋 The textbook uses the term SSE (sum of squared errors).
\(R^2\)
The SSR is also useful in describing the strength of the model.
The SST (total sum of squares) is the sum of squared distance to the mean. \[
SST = (x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + \dots + (x_n - \bar{x})^2
\] It measured the total amount of variability in the data.
Remember the formula for SSR \[
SSR = (x_1 - \hat{x})^2 + (x_2 - \hat{x})^2 + \dots + (x_n - \hat{x})^2
\] is the amount of variability in the data left unexplained by the model.
By analogy, \(SST - SSR\) is the amount of variation explained by the model: \[
\text{data} = SST = SSR + (SST-SSR) = \text{residuals} + \text{model}
\]
The coefficient of determination \(R^2\) measures the proportion of variation in the data that is explained by the model
Remember, in a boxplot, outliers are observation far from the bulk of the data
ggplot(d) +geom_boxplot(aes(hwy))
In the context of regression model, an outlier is: an observation that falls far from the cloud of points
Group exercise - outlier
Look at the scatterplot for our data. Are there outliers?
ggplot(d) +geom_point(aes(cty, hwy))
Exercise 7.25
03:00
Outliers, leverage and influential points
outlier: observation that falls far from the cloud of points
High leverage point: observation that falls horizontally away from the cloud of points
influential point: observation that influences the slope of the regression line
All influential points are high leverage points.
All leverage points are outliers.
(Venn Diagram)
Dealing with outliers
In regression, outliers have the potential to highly influence the least-square estimates - the least-square estimates are not robust to the presence of outliers
Impact of outliers
Let us contaminate the data with an outlier (cty\(=10\) and hwy\(=1000\)) and fit the same regression model.